This post is the first in a series about assessing similarity between time series. More specifically, we will be interested in alignment-based metrics, Here we use the term “metrics” in a pretty unformal manner, that is an equivalent of “similarity measure”. that rely on a temporal alignment of the series in order to assess their similarity.
Before entering into more details about these metrics, let us define our base objects: time series. In the following, a time series is a sequence of features: \(x = \left(x_0, \dots, x_{n-1}\right)\). All features from a time series lie in the same space \(\mathbb{R}^p\). Below is an example univariate A time series is said univariate if all its feature vectors are monodimensional (\(p=1\)). time series:
Let us now illustrate the typical behavior of alignment-based metrics with an example.
Here, we are computing similarity between two time series using either Euclidean distance (left) or Dynamic Time Warping (DTW, right), which is an instance of alignment-based metric, that we will present in more details later in this post. In both cases, the returned similarity is the sum of distances over all matches (represented by gray lines here). Note how DTW matches distinctive patterns of the time series, which is likely to result in a more sound similarity assessment.
The figure above is the result of a \(k\)-means clustering that uses Euclidean distance as a base metric. One issue with this metric is that it is not invariant to time shifts, while the dataset at stake clearly holds such invariants.
When using a shift-invariant similarity measure (discussed in our {ref}sec:dtw section) at the core of \(k\)-means, one gets:
This part of the course tackles the definition of adequate similarity measures for time series and their use at the core of machine learning methods.
Dynamic Time Warping (DTW) is a similarity measure between time series. It has been introduced independently in the literature by [Vint68] and [SaCh78], in both cases for speech applications. Note that, in this series of posts, we will stick to the formalism from [SaCh78], which is more standard in the literature.
Consider two time series \(x\) and \({x}^\prime\) of respective lengths \(n\) and \(m\).
Here, all elements \(x_i\) and \(x^\prime_j\) are assumed to lie in the same \(p\)-dimensional space and the exact timestamps at which observations occur are disregarded: only their ordering matters.
Dynamic Time Warping, in this context, is equivalent to minimizing Euclidean distance between aligned time series under all admissible temporal alignments, as illustrated in the Figure below.
More formally, the optimization problem writes:
\[\begin{equation} DTW_q({x}, {x}^\prime) = \min_{\pi \in \mathcal{A}({x}, {x}^\prime)} \left( \sum_{(i, j) \in \pi} d(x_i, x^\prime_j)^q \right)^{\frac{1}{q}} \label{eq:dtw} \end{equation}\]
Here, an alignment path \(\pi\) of length \(K\) is a sequence of \(K\) index pairs \(\left((i_0, j_0), \dots , (i_{K-1}, j_{K-1})\right)\) and \(\mathcal{A}({x}, {x}^\prime)\) is the set of all admissible paths. In order to be considered admissible, a path should satisfy the following conditions:
Beginning (resp. end) of time series are matched together:
The sequence is monotonically increasing in both \(i\) and \(j\) and all time series indexes should appear at least once, which can be written:
Another way to represent a DTW path is to use a binary matrix whose non-zero entries are those corresponding to a matching between time series elements. This representation is related to the index sequence representation used above through:
\[\begin{equation} (A_\pi)_{i,j} = \left\{ \begin{array}{rl} 1 & \text{ if } (i, j) \in \pi \\ 0 & \text{ otherwise} \end{array} \right. . \end{equation}\]
This is illustrated in the Figure below where the binary matrix is represented as a grid on which the DTW path \(\pi\) is superimposed, and each dot on the grid corresponds to a non-zero entry in \(A_\pi\):
Using matrix notation, Dynamic Time Warping can be written as the minimization of a dot product between matrices:
\[\begin{equation*} DTW_q({x}, {x}^\prime) = \min_{\pi \in \mathcal{A}({x}, {x}^\prime)} \left\langle A_\pi, D_q({x}, {x}^\prime) \right\rangle^{\frac{1}{q}} \end{equation*}\]
where \(D_q({x}, {x}^\prime)\) stores distances \(d(x_i, x^\prime_j)\) at the power \(q\).
An exact solution to this optimization problem can be found using Dynamic Programming. The essence of dynamic programming is to link the solution of a given problem to solutions of (easier) sub-problems. Once this link is known, one can solve the original problem by recursively solving required sub-problems and storing their solutions.
In the case of DTW, we need to rely on the following quantity:
\[ \gamma_{i,j} = DTW_q({x}_{\rightarrow i}, {x}^\prime_{\rightarrow j})^q \]
where the notation \({x}_{\rightarrow i}\) denotes time series \({x}\) observed up to timestamp \(i\) (included). Then, we can observe that:
\[ \begin{aligned} \gamma_{i,j} &= \min_{\pi \in \mathcal{A}({x}_{\rightarrow i}, {x}^\prime_{\rightarrow j})} \sum_{(k, l) \in \pi} d(x_k, x^\prime_l)^q \\ &\stackrel{*}{=} d(x_i, x^\prime_j)^q + \min_{\pi \in \mathcal{A}({x}_{\rightarrow i}, {x}^\prime_{\rightarrow j})} \sum_{(k, l) \in \pi[:-1]} d(x_k, x^\prime_l)^q \\ &\stackrel{**}{=} d(x_i, x^\prime_j)^q + \min (\gamma_{i-1, j}, \gamma_{i, j-1}, \gamma_{i-1, j-1}) \end{aligned} \]
Note that \((*)\) comes from the constraints on admissible paths \(\pi\): the last element on an admissible path needs to match the last elements of the series; Note also that \((**)\) comes from the contiguity conditions on the admissible paths. Indeed, a path that would align time series \({x}_{\rightarrow i}\) and \({x}^\prime_{\rightarrow j}\) necessarily encapsulates either:
as illustrated in the Figure below:
TODO: fig DTW transitions
This implies that filling a matrix that would store \(\gamma_{i,j}\) terms row-by-row In practice, the matrix could be filled column-by-column too. The important part is that the terms gamma[i-1, j], gamma[i, j-1] and gamma[i-1, j-1] are accessible when computing gamma[i, j]. When vectorizing code is of importance, an even better strategy is to compute the gamma terms one anti-diagonal at a time [TrDe20]. is sufficient to retrieve \(DTW_q({x}, {x}^\prime)\) as \({\gamma_{n-1, m-1}}^{\frac{1}{q}}\).
These observations result in the following \(O(mn)\) algorithm to compute the exact optimum for DTW (assuming computation of \(d(\cdot,\cdot)\) is \(O(1)\)):
def dtw(x, x_prime, q=2):
for i in range(len(x)):
for j in range(len(x_prime)):
gamma[i, j] = d(x[i], x_prime[j]) ** q
if i > 0 or j > 0:
gamma[i, j] += min(
gamma[i-1, j ] if i > 0 else inf,
gamma[i , j-1] if j > 0 else inf,
gamma[i-1, j-1] if (i > 0 and j > 0) else inf
)
return gamma[-1, -1] ** (1. / q)
Dynamic Time Warping holds the following properties:
However, mathematically speaking, DTW is not a valid metric since it satisfies neither the triangular inequality nor the identity of indiscernibles.
As we have seen, Dynamic Time Warping is invariant to time shifts, whatever their magnitude. In order to allow invariances to local deformations only, one can impose additional constraints on the set of acceptable paths. Such constraints typically translate into enforcing nonzero entries in admissible \(A_\pi\) to stay close to the diagonal.
TODO: visu path matrices
The Sakoe-Chiba band [SaCh78] is parametrized by a radius \(r\) (also called warping window size sometimes), while the Itakura parallelogram [Itak75] sets a maximum slope \(s\) for alignment paths, which leads to a parallelogram-shaped constraint. As shown in the Figure below, setting global constraints on admissible DTW paths is equivalent to restricting the set of possible matches for each element in a time series. The number of possible matches for an element is always \(2r+1\) for Sakoe-Chiba constraints (except for border elements), while it varies depending on the time index for Itakura parallelograms.
As seen in the Figure below, DTW with a Sakoe-Chiba band constraint of radius \(r\) is invariant to time shifts of magnitude up to \(r\), but is no longer invariant to longer time shifts.
We have seen in this post how alignment-based metrics can prove useful when dealing with temporally shifted time series. We have presented in more details the most common of these metrics, which is Dynamic Time Warping (DTW). If you enjoyed this post, stay tuned, a new one should be published soon on the specific topic of the differentiability of DTW.